read.table()read.csv(), read.delim()write.table()write.csv()In R Studio use the mouse and browse to the directory where you saved the Course Materials
Session → Set Working Directory → Choose Directory…
Before we even start the analysis, we need to be sure of where the data are located on our hard drive
## [1] "/home/sangram/Documents/learn-R"
file.exists function does exactly what it says on the tin!
## [1] TRUE
file.choose()patients)In the latest RStudio, there is the option to import data directly from the File menu. File -> Import Dataset -> From Csv
sep="," or the function read.csv():## ID.Race.Sex.Smokes.Height.Weight.State.Pet.Grade.Age
## 1 AC/AH/001\tWhite\tMale\tNon-Smoker\t182.87\t76.57\tGeorgia\tDog\t2\t85
## 2 AC/AH/017\tWhite\tMale\tNon-Smoker\t179.12\t80.43\tMissouri\tDog\t2\t85
## 3 AC/AH/020\tBlack\tMale\tNon-Smoker\t169.15\t75.48\tPennsylvania\tNone\t2\t47
## 4 AC/AH/022\tWhite\tMale\tNon-Smoker\t175.66\t94.54\tFlorida\tCat\t1\t72
## 5 AC/AH/029\tWhite\tFemale\tNon-Smoker\t164.47\t71.78\tIowa\tNA\t2\t70
## 6 AC/AH/033\tNA\tFemale\tSmoker\t158.27\t69.9\tMaryland\tDog\t2\t66
## ID Race Sex Smokes Height Weight State Pet
## 1 AC/AH/001 White Male Non-Smoker 182.87 76.57 Georgia Dog
## 2 AC/AH/017 White Male Non-Smoker 179.12 80.43 Missouri Dog
## 3 AC/AH/020 Black Male Non-Smoker 169.15 75.48 Pennsylvania None
## 4 AC/AH/022 White Male Non-Smoker 175.66 94.54 Florida Cat
## 5 AC/AH/029 White Female Non-Smoker 164.47 71.78 Iowa <NA>
## 6 AC/AH/033 <NA> Female Smoker 158.27 69.90 Maryland Dog
## 7 AC/AH/037 White Female Non-Smoker 161.69 68.85 Pennsylvania None
## 8 AC/AH/044 White Female Non-Smoker 165.84 70.44 North Carolina None
## 9 AC/AH/045 White Male Non-Smoker 181.32 76.90 Louisiana Dog
## 10 AC/AH/048 Hispanic Male Non-Smoker 167.37 79.06 North Carolina None
## Grade Age
## 1 2 85
## 2 2 85
## 3 2 47
## 4 1 72
## 5 2 70
## 6 2 66
## 7 1 24
## 8 1 68
## 9 1 86
## 10 2 63
View() function to get a display of the data in RStudio:## [1] "data.frame"
## [1] 10
## [1] 100
## [1] 100 10
## [1] "ID" "Race" "Sex" "Smokes" "Height" "Weight" "State"
## [8] "Pet" "Grade" "Age"
## [1] AC/AH/001 AC/AH/017 AC/AH/020 AC/AH/022 AC/AH/029 AC/AH/033 AC/AH/037
## [8] AC/AH/044 AC/AH/045 AC/AH/048 AC/AH/049 AC/AH/050 AC/AH/052 AC/AH/053
## [15] AC/AH/057 AC/AH/061 AC/AH/063 AC/AH/076 AC/AH/077 AC/AH/086 AC/AH/089
## [22] AC/AH/100 AC/AH/104 AC/AH/112 AC/AH/113 AC/AH/114 AC/AH/115 AC/AH/127
## [29] AC/AH/133 AC/AH/150 AC/AH/154 AC/AH/156 AC/AH/159 AC/AH/160 AC/AH/164
## [36] AC/AH/171 AC/AH/176 AC/AH/180 AC/AH/185 AC/AH/186 AC/AH/192 AC/AH/198
## [43] AC/AH/207 AC/AH/208 AC/AH/210 AC/AH/211 AC/AH/213 AC/AH/219 AC/AH/220
## [50] AC/AH/221 AC/AH/225 AC/AH/233 AC/AH/241 AC/AH/244 AC/AH/248 AC/AH/249
## [57] AC/SG/002 AC/SG/003 AC/SG/008 AC/SG/009 AC/SG/010 AC/SG/015 AC/SG/016
## [64] AC/SG/046 AC/SG/055 AC/SG/056 AC/SG/064 AC/SG/065 AC/SG/067 AC/SG/068
## [71] AC/SG/072 AC/SG/074 AC/SG/084 AC/SG/095 AC/SG/099 AC/SG/101 AC/SG/107
## [78] AC/SG/116 AC/SG/121 AC/SG/122 AC/SG/123 AC/SG/134 AC/SG/139 AC/SG/142
## [85] AC/SG/155 AC/SG/165 AC/SG/167 AC/SG/172 AC/SG/173 AC/SG/179 AC/SG/181
## [92] AC/SG/182 AC/SG/191 AC/SG/193 AC/SG/194 AC/SG/197 AC/SG/204 AC/SG/216
## [99] AC/SG/217 AC/SG/234
## 100 Levels: AC/AH/001 AC/AH/017 AC/AH/020 AC/AH/022 AC/AH/029 ... AC/SG/234
Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham - RStudio chief scientist and author of dplyr, ggplot2 and others) You will make your life a lot easier if you keep your data tidy and organised. Before blaming R, consider if your data are in a suitable form for analysis. The more manual manipulation you have done on the data (highlighting, formulas, copy-and-pasting), the less happy R is going to be to read it. Here are some useful links on some common pitfalls and how to avoid them
NA values, which means the values are missing – a common occurrence in real data collectionNA is a special value that can be present in objects of any type (logical, character, numeric etc)NA is not the same as NULL:
NULL is an empty R object.NA is one missing value within an R object (like a data frame or a vector)NAs gracefully:## [1] 100
## [1] NA
NAs, and functions often have their own arguments (like na.rm) for handling them:
NA values. Always check the documentation## [1] 167.4969
## [1] 167.4969
# Create an index of results:
BMI <- (patients$Weight)/((patients$Height/100)^2)
upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
upper.limit## [1] 30.9533
## [1] 22.9 25.1 26.4 30.6 26.5 27.9 26.3 25.6 23.4 28.2 28.2 NA 30.0 27.9
## [15] 24.5 22.0 25.6 31.5 23.8 NA 23.5 26.7 31.4 NA 24.6 NA 24.8 29.2
## [29] NA 24.1 25.1 28.0 29.4 28.2 23.6 26.4 NA 25.0 27.7 27.0 25.6 26.7
## [43] 24.5 26.1 23.1 28.2 26.9 NA 25.4 25.9 NA 24.8 28.2 NA 30.4 26.8
## [57] 26.0 25.2 26.9 31.7 25.6 NA 26.7 27.8 28.4 NA 31.5 27.0 30.0 26.5
## [71] 25.2 NA 26.7 25.8 NA 27.6 29.1 26.6 26.6 26.9 27.6 26.4 27.8 NA
## [85] 27.8 25.8 27.7 28.7 24.2 24.6 28.3 24.8 27.8 21.4 28.0 26.0 26.2 26.4
## [99] 27.7 NA
## ID Race Sex Smokes Height Weight State Pet Grade
## 1 AC/AH/001 White Male Non-Smoker 182.87 76.57 Georgia Dog 2
## 2 AC/AH/017 White Male Non-Smoker 179.12 80.43 Missouri Dog 2
## 3 AC/AH/020 Black Male Non-Smoker 169.15 75.48 Pennsylvania None 2
## 4 AC/AH/022 White Male Non-Smoker 175.66 94.54 Florida Cat 1
## 5 AC/AH/029 White Female Non-Smoker 164.47 71.78 Iowa <NA> 2
## 6 AC/AH/033 <NA> Female Smoker 158.27 69.90 Maryland Dog 2
## Age BMI
## 1 85 22.9
## 2 85 25.1
## 3 47 26.4
## 4 72 30.6
## 5 70 26.5
## 6 66 27.9
<- is doing an assignment. The value we are assigning to our new variable is the logical (TRUE or FALSE) vector given by testing each item in BMI against the upper.limit## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] NA FALSE FALSE FALSE FALSE FALSE TRUE FALSE NA FALSE FALSE
## [23] TRUE NA FALSE NA FALSE FALSE NA FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE NA FALSE FALSE NA FALSE FALSE NA FALSE
## [56] FALSE FALSE FALSE FALSE TRUE FALSE NA FALSE FALSE FALSE NA
## [67] TRUE FALSE FALSE FALSE FALSE NA FALSE FALSE NA FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] NA
We have seen that a logical vector can be used to subset a data frame
## ID Race Sex Smokes Height Weight State Pet Grade
## NA <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## 18 AC/AH/076 White Male Non-Smoker 176.22 97.67 Louisiana Cat 2
## NA.1 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## 23 AC/AH/104 White Male Smoker 169.85 90.63 Kentucky None 1
## NA.2 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## NA.3 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## NA.4 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## NA.5 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## NA.6 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## NA.7 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## NA.8 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## 60 AC/SG/009 White Male Non-Smoker 166.84 88.25 Vermont Dog 1
## NA.9 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## NA.10 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## 67 AC/SG/064 White Male Non-Smoker 169.16 90.08 Illinois Cat 2
## NA.11 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## NA.12 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## NA.13 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## NA.14 <NA> <NA> <NA> <NA> NA NA <NA> <NA> NA
## Age BMI
## NA NA NA
## 18 26 31.5
## NA.1 NA NA
## 23 87 31.4
## NA.2 NA NA
## NA.3 NA NA
## NA.4 NA NA
## NA.5 NA NA
## NA.6 NA NA
## NA.7 NA NA
## NA.8 NA NA
## 60 43 31.7
## NA.9 NA NA
## NA.10 NA NA
## 67 44 31.5
## NA.11 NA NA
## NA.12 NA NA
## NA.13 NA NA
## NA.14 NA NA
The which function will take a logical vector and return the indices of the TRUE values
## [1] 18 23 60 67
To recap, the set of R commands we have used is:-
patients <- read.delim("data/patient-info.txt")
BMI <- (patients$Weight)/((patients$Height/100)^2)
upper.limit <- mean(BMI,na.rm = TRUE) + 2*sd(BMI,na.rm = TRUE)
plot(BMI)
# Add a horizonal line:
abline(h=upper.limit) read.table() Created and Maintained by Sangram Keshari Sahu
Licensed under CC-BY 4.0
Source Code At GitHub
Template used from Rmdplates package